变压器验证引起了机器学习研究和行业的越来越多的关注。它正式验证了变压器对对抗性攻击的鲁棒性,例如用同义词交换单词。但是,由于以中线为中心的计算,变压器验证的性能仍然不令人满意,这与标准神经网络有显着差异。在本文中,我们提出了信仰,这是用于GPU的变压器验证的有效框架。我们首先提出一个语义意识的计算图转换,以识别语义信息,例如变压器验证中的结合计算。我们利用此类语义信息,以在计算图级别启用有效的内核融合。其次,我们提出了一个验证专门的内核手工艺品,以有效地将变压器验证映射到现代GPU。该手工艺者利用了一组GPU硬件支持,以加速通常是内存密集型的验证专业操作。第三,我们提出了一个专家指导的自动调整,以纳入有关GPU后端的专家知识,以促进大型搜索空间探索。广泛的评估表明,Faith在最先进的框架上实现了$ 2.1 \ times $至$ 3.4 \ times $($ 2.6 \ times $)的加速。
translated by 谷歌翻译
图形神经网络(GNN)的输入图的大小不断增加,突显了使用多GPU平台的需求。但是,由于计算不平衡和效率较低的通信,现有的多GPU GNN解决方案遭受了劣质性能。为此,我们提出了MGG,这是一种新型的系统设计,可以通过以GPU为中心的软件管道在多GPU平台上加速GNN。 MGG探讨了通过细粒度计算通信管道中隐藏GNN工作负载中远程内存访问延迟的潜力。具体而言,MGG引入了管​​道感知工作负载管理策略和混合数据布局设计,以促进通信局限性重叠。 MGG实现以优化的管道为中心的内核。它包括工作负载交织和基于经经的映射,以进行有效的GPU内核操作管道和专门的内存设计以及优化,以更好地数据访问性能。此外,MGG还结合了轻巧的分析建模和优化启发式方法,以动态提高运行时不同设置的GNN执行性能。全面的实验表明,MGG在各种GNN设置上的最先进的多GPU系统要比最先进的多GPU系统:平均比具有统一虚拟内存设计的多GPU系统快3.65倍,平均比DGCL框架快7.38倍。
translated by 谷歌翻译
通过采用深层CNN(卷积神经网络)和GCN(图卷积网络),最近对3D点云语义分割的研究努力取得了出色的表现。然而,这些复杂模型的鲁棒性尚未得到系统地分析。鉴于在许多安全关键型应用中应用了语义分割(例如,自主驾驶,地质感测),特别是填补这种知识差距,特别是这些模型在对抗性样本下的影响。虽然已经研究了针对点云的对抗攻击,但我们发现所有这些都是针对单一物体识别的,并且在点坐标上进行扰动。我们认为,基于坐标的扰动不太可能在物理世界的限制下实现。因此,我们提出了一种名为Colper的新的无色扰动方法,并将其定制为语义分割。通过评估室内数据集(S3DIS)和室外数据集(语义3D)对三点云分割模型(PointNet ++,Deepgcns和Randla-Net)进行评估,我们发现只有颜色的扰动足以显着降低分割精度和AIOU ,在目标和非目标攻击设置下。
translated by 谷歌翻译
最近,作为基于图形机器学习的骨干的图形神经网络(GNN)展示了各个域(例如,电子商务)的巨大成功。然而,由于基于高稀疏和不规则的图形操作,GNN的性能通常不令人满意。为此,我们提出,TC-GNN,基于GNN加速框架的第一个GPU张量核心单元(TCU)。核心思想是将“稀疏”GNN计算与“密集”TCU进行调和。具体地,我们对主流GNN计算框架中的稀疏操作进行了深入的分析。我们介绍了一种新颖的稀疏图翻译技术,便于TCU处理稀疏GNN工作量。我们还实现了一个有效的CUDA核心和TCU协作设计,以充分利用GPU资源。我们将TC-GNN与Pytorch框架完全集成,以便于编程。严格的实验在各种GNN型号和数据集设置的最先进的深图库框架上平均显示了1.70倍的加速。
translated by 谷歌翻译
预计变形量子算法将展示量子计算在近期嘈杂量子计算机上的优点。然而,由于算法的大小增加,训练这种变分量子算法遭受梯度消失。以前的工作无法处理由现实量子硬件的必然噪声效应引起的渐变消失。在本文中,我们提出了一种新颖的培训方案,以减轻这种噪声引起的渐变消失。我们首先介绍一种新的成本函数,其中通过在截断的子空间中使用无意程可观察来显着增强梯度。然后,我们证明可以通过从新的成本函数与梯度优化原始成本函数来达到相同的最小值。实验表明,我们的新培训方案对于各种任务的主要变分量子算法非常有效。
translated by 谷歌翻译
多年来,通过广泛研究了与量化的神经网络。遗憾的是,在GPU上的有限精度支持(例如,INT1和INT4)上通常限制具有多样化的精度(例如,1位重量和2位激活)的事先努力。为了打破这种限制,我们介绍了第一个任意精密神经网络框架(APNN-TC),以充分利用对AMPERE GPU张量核心的量化优势。具体地,APNN-TC首先结合了一种新的仿真算法来支持与INT1计算基元和XOR /和BOOLEAN操作的任意短比特宽度计算。其次,APNN-TC集成了任意精密层设计,以有效地将仿真算法映射到带有新型批处理策略和专业内存组织的张量核心。第三,APNN-TC体现了一种新型任意精密NN设计,可最大限度地减少层次的内存访问,并进一步提高性能。广泛的评估表明,APNN-TC可以通过Cutlass内核和各种NN模型实现显着加速,例如Reset和VGG。
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译
In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.
translated by 谷歌翻译
A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译